Performance Bounds for Graphical Record Linkage
نویسندگان
چکیده
Record linkage involves merging records in large, noisy databases to remove duplicate entities. It has become an important area because of its widespread occurrence in bibliometrics, public health, official statistics production, political science, and beyond. Traditional linkage methods directly linking records to one another are computationally infeasible as the number of records grows. As a result, it is increasingly common for researchers to treat record linkage as a clustering task, in which each latent entity is associated with one or more noisy database records. We critically assess performance bounds using the Kullback-Leibler (KL) divergence under a Bayesian record linkage framework, making connections to Kolchin partition models. We provide an upper bound using the KL divergence and a lower bound on the minimum probability of misclassifying a latent entity. We give insights for when our bounds hold using simulated data and provide practical user guidance.
منابع مشابه
A Hierarchical Graphical Model for Record Linkage
The task of matching co-referent records is known among other names as record linkage. For large record-linkage problems, often there is little or no labeled data available, but unlabeled data shows a reasonably clear structure. For such problems, unsupervised or semi-supervised methods are preferable to supervised methods. In this paper, we describe a hierarchical graphical model framework for...
متن کاملFebrl – A Freely Available Record Linkage System with a Graphical User Interface
Record or data linkage is an important enabling technology in the health sector, as linked data is a costeffective resource that can help to improve research into health policies, detect adverse drug reactions, reduce costs, and uncover fraud within the health system. Significant advances, mostly originating from data mining and machine learning, have been made in recent years in many areas of ...
متن کاملProbabilistic Linkage of Persian Record with Missing Data
Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...
متن کاملOn Moments of the Concomitants of Classic Record Values and Nonparametric Upper Bounds for the Mean under the Farlie-Gumbel-Morgenstern Model
In a sequence of random variables, record values are observations that exceed or fall below the current extreme value.Now consider a sequence of pairwise random variables {(Xi,Yi), i>=1}, when the experimenter is interested in studying just thesequence of records of the first component, the second component associated with a record value of the first one is termed the concomitant of that ...
متن کاملVariational Approximations for Probabilistic Graphical Models
Graphical models are a framework that allows for the incorporation of prior knowledge in a convenient manner. These models are commonly used in a variety of fields, and were found useful in many applications. The computations required in some applications of interest are demanding or even infeasible, and existing approximations are not always sufficient. The focus of this thesis is the developm...
متن کامل